NGS Based Haplotype Assembly Using Matrix 
Completion 


1 ABSTRACT 

We use matrix completion methods for haplotype assembly from NGS reads to develop the new 
HapSVT, HapNuc, and HapOPT algorithms. This is performed by applying a mathematical model 
to convert the reads to an incomplete matrix and estimating unknown components. This is followed 
by quantizing and decoding the completed matrix in order to generate haplotypes. These algorithms 
are compared to the recently addressed SDhaP algorithm for the real fosmid data. It is shown that 
reconstruction rate and the MSE of these algorithms outperform the SDhaP. Also, the MEG score 
of the HapOPT is lower than that of the SDhaP with almost the same running time. 

Keyword : Haplotype assembly, Single nucleotide polymorphism, Computational biology,Minimum 
error correction, Matrix completion, Singular value decomposition. 

2 INTRODUCTION 

The Single Nucleotide Polymorphism (SNP) is a kind of genetic variation with a frequency greater 
than 1% in population. In diploid organisms, genomes are organized into pairs of chromosomes, a 
paternal and a maternal copy. The sequence of SNPs on each copy of a pair of chromosomes is called 
a haplotype. A genotype is the conflation of two haplotypes on the homologous chromosomes. An 
SNP is called homozygous, if a pair of alleles at this locus is made up of two identical nucleotides, 
and is heterozygous, otherwise. From the evolutionary point of view, the SNP has been occurred as 
a consequence of mutation. Since the rate of mutation is low, twice mutation of a specific locus is 
too rare. Thus, it is usual to assume that the majority of SNPs are bi-allelic which means that each 
SNP can be chosen from just two of the four possible nucleotides, i.e., A, T, C, and G [1]. Here, we 
similarly use the same assumption. The haplotype is widely used in the genome wide association 
studies, clinical genetics, linkage analysis, drug-design, and personalized medicine [2], 

To extract a haplotype, one may use the following three approaches where the last two ones are 
computational based: 

1) Applying high-cost experimental and expensive methods for every single individual which is not 
desirable [2], 

2) Haplotype phasing wherein the haplotypes are inferred from the genotypes of multiple individu¬ 
als. As such, a method based on the maximum parsimony assumption [3] and statistical methods 
like the SHAPEIT, developed based on the Hidden Markov Model (HMM) [1, 4] may be mentioned. 
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Note that using this approach, the haplotype of an individual can not be found separately and also 
is challenged by low-frequency variants and de novo variants [2], 

3) Estimating haplotypes from Next Generation Sequencing (NGS) reads i.e. the nucleotide sequence 
of fragments. Using this approach, mentioned as the haplotype assembly, haplotyping of a single 
individual becomes feasible. In this regard, HapCUT [5], Hap Tree [6], and HapSAT [7] are three 
famous methods which are developed based on probabilistic models. These methods are sensitive 
to the selected model and thus fragile to the model error. A recent method for haplotype assembly 
is the SDhaP [8] which has shown accurate results compared to the HapCut, HapTree, and ReFHap 

[9] . This heuristic method which makes use of correlation clustering and non-convex optimization 
does not guarantee reaching the global optimum. 

The innovation of this article is threefold. First, in Section 3 matrix completion is shown for 
haplotype assembly application in a mathematical form. Secondly, we propose three new algorithms 
including Haplotype assembly based on Singular Value Thresholding (HapSVT), Haplotype assem¬ 
bly based on Nuclear norm minimization (HapNuc), and Haplotype assembly based on OPTSPACE 
(HapOPT) in Section 4. 

In Section 5, these algorithms are evaluated using computer simulations and compared to the 
recent method, SDhaP, in terms of reconstruction rate, mean square error and minimum error 
correction score. Section 6 concludes the paper. 

3 MODEL OF HAPLOTYPES 

To exploit NGS reads as the raw data, a computational modeling is needed. To do so, first we 
convert the sequence of nucleotides which can be either reads or haplotypes into a sequence of 
numbers. The SNP nucleotides are converted to 1 and —1 for the wild and rare alleles, respectively 

[10] . As an example, Table 1 depicts the alleles of the /^AR, gene [3] for which the maternal and 
paternal haplotypes of an individual are shown by h m and h p , respectively. The corresponding 
codewords based on the above modeling are presented in the last column. 


Tab. 1: Haplotpyes of /3 2 AR genes and its corresponding code word. 



Nucleotides 

Code word 

Allels 

G/A 

C/A 

G/A 

C/G 

T/C 

T/C 

T/C 

G/A 

C/G 

G/A 

{1/-M/-1,...} 

hm 

A 

C 

G 

G 

C 

C 

C 

G 

G 

G 

{-1,1, l r l r l,-l r l, 1,-1,1} 

hp 

G 

c 

A 

C 

T 

T 

T 

A 

C 

G 

{ 1,1,-1, 1, 1, 1, 1,-1, 1,1} 


It is assumed that each read has been aligned to the reference genome. Then, the non-SNP sites 
of each read are omitted, and thus, the i-th read with the length of U has just the information of 
li sites from the whole l sites of the haplotype. Then, the i-th read is coded using the procedure 
described in Table 1 and is completed by adding zeros for the length of l as shown for 10 aligned 
reads in Table 2. For example, for the 1st row, we get {-1 11000000 0} which consists of 3 
sites of ±1 and 7 sites of zeros. 

Without loss of generality, by showing each row by the vector ry,i = 1,..., TV, we can generate 
the read matrix R where N is the number of reads as shown below. R is an incomplete version 
of H which consists of the maternal and paternal haplotypes as its rows, and thus, its rank is 
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Tab. 2: Example of aligned reads for /^AR genes and the considered code words. 


Reads 

Nucleotides 

Code words 

1 

A 

C 

G 








-1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

2 



G 

G 

C 

C 





0 

0 

1 

-1 

-1 

-1 

0 

0 

0 

0 

3 



G 

G 





G 

G 

0 

0 

1 

-1 

0 

0 

0 

0 

-1 

1 

4 

G 

C 

A 

C 

T 

T 





1 

1 

-1 

1 

1 

1 

0 

0 

0 

0 

5 



A 

C 



T 

A 

C 

G 

0 

0 

-1 

0 

0 

1 

1 

-1 

1 

1 

6 

G 

C 



T 

T 





1 

1 

0 

0 

1 

1 

0 

0 

0 

0 

7 


C 



C 





G 

-1 

1 

0 

0 

-1 

0 

0 

0 

0 

1 

8 

A 

C 



C 

C 

C 




-1 

1 

0 

0 

-1 

-1 

-1 

0 

0 

0 

9 

G 



C 



T 

A 

C 


1 

0 

0 

1 

0 

0 

1 

-1 

1 

0 

10 



A 

C 





C 

G 

0 

0 

-1 

1 

0 

0 

0 

0 

1 

1 


2. Having obtained a low rank matrix, we may utilize matrix completion approach to solve the 
problem. Specifically, by estimating the zero entries of R, it becomes complete as shown by the 
matrix H which has the same dimension as R, i.e., N x l where l denotes the haplotype length. 


-1110000000 
0 0 1 -1 -1 -1 0 0 0 0 

001 -1 0 0 0 0 -1 1 
11-11 1 1 0 0 00 

00 -1 0 0 1 1 -1 11 
1 1 0 0 1 1 0 0 0 0 

-1 1 0 0 -1 0 0 0 0 1 

-1 1 0 0 -1 -1 -1 0 0 0 

100 1 0 0 1 -1 10 
00-11 0 0 0 0 11 

'-1 11 - 1-1 -1 -1 1 -1 1 
- 111-1 -1 -1 - 11-11 

- 111-1 -1 -1 - 11-11 

11-11 1 1 1-111 

11-11 1 1 1-111 

11-11 1 1 1-111 

- 111-1 -1 -1 - 11-11 

- 111-1 -1 -1 - 11-11 

11-11 1 1 1-111 

11-11 1 1 1-111 


( 1 ) 


( 2 ) 


According to the definition of H, only two of its rows are different and thus the desired haplo- 
types are given by 


h m =[-111 -1 -1 -1 -1 1-11], 

^=[11-11111-111], 


( 3 ) 

( 4 ) 
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Then, these vectors can be decoded to the sequence of nucleotides using the first row of Table 1. 
To the best of our knowledge, no algorithm has been reported to distinguish between the maternal 
and paternal haplotypes and therefore h v and h m may be interchanged with each other. 

4 PROPOSED METHODS 

Here, we present three algorithms for haplotype assembly whose general block diagram is illustrated 
in Figure 1. The goal is to estimate h p and h rn as the output. The first two parts of the block 
diagram, i.e., converting nucleotides to sequences of numbers and preparing a read matrix were 
explained in Section 3. 



Fig. 1 : Block diagram of the proposed algorithms. 

We consider an incomplete matrix R with a few known entries where the set of indices of known 
entries is given by Then, we intend to estimate the unknown entries based on rank assumption. 
Mathematically, this is defined by the following optimization problem: 

Find H subject to rank(iT) = r, H tJ — R { j for (5) 

Assuming r — 2, (1) can be converted to [11] 

min rank (H) subject to H VJ = R VJ for (i,j) G fh 


( 6 ) 
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To solve (2), three approaches named as nuclear norm minimization, Singular Value Threshold¬ 
ing (SVT), and OPTSPACE have already been reported [12]. Based on these algorithms, in the 
following we introduce three novel algorithms for the purpose of haplotype assembly mentioned as 
the HapSVT, HapNuc, and HapOPT. 

4.1 Haplotype assembly using HapSVT 

To explain the proposed HapSVT algorithm, we first introduce the SVT which is based on Singular 
Value Decomposition (SVD) [13] defined for the read matrix R as 

R = S = diag(cq) i = 1,..., r (7) 

where H denotes the hermitian operator, and U and V have orthonormal columns with the 
dimension of N x r and / x r, respectively. By applying the soft-thresholding operator, or namely, 
the singular value shrinkage operator D T (-) to H, we obtain 

Dr(H) = UD T {H)V H (8) 

where 

D T {11) = diag(max{(Tj — r, 0}). (9) 

It is worth noting that D T [H) is the optimal value of the optimization problem 

- Zf F + t\\Z\\, (10) 

where || ■ ||^ is the Frobenius norm and || • ||* shows the nuclear norm as the summation of singular 
values. 

To perform the matrix completion part as shown in Figure 1, we recursively use the SVT in 
two steps. In the first step, starting with the initial matrix Y° = R, the singular value shrinkage 
operator is computed as 


X k = D T {Y k ~ l ). (11) 

Then, in the second step, the difference between the projected matrix X k and the initial matrix 
is compensated for the known entries using 

Y k = Y k- 1 + SVn ( R _ X k ) (12) 

for k = 1, 2,..., where Vn(-) is an operator which keeps those entries of a matrix corresponding 
to unchanged, and sets the other entries to zero. The iterations continue until the condition 
\\ / Pci(X k — R)\\f < e|| R\\f is satisfied and the last X k is reported as the completed matrix. 

For the purpose of haplotype assembly, we need to find H by quantizing the entries of the 
completed matrix of SVT to 1 and -1 as shown in Figure 1. Then, the two different rows of H are 
considered as the paternal and maternal haplotypes. The procedures of the HapSVT algorithm is 
depicted in Algorithm 1. 
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Algorithm 1: Haplotype assembly using SVT (HapSVT). 
input : N aligned reads 
output: Haplotypes h m , h p 
/* preparing the read matrix */ 

1 Convert the sequences of nucleotides (reads) to the sequences 

of numbers 

2 Add zeros to each read to construct rqs with the length of l 

3 Construct the read matrix R (A x /) 

/* SVT algorithm */ 

4 Initialization Y° = R, k = 0, % = 1 

5 while \\V n (X k - i?)|| F < e||i?|| F do 

6 k — k + 1 

7 X k = D T {Y k ~ l ) 

8 Y k = Y k ~ l + 5Vn(R - x k ) 

9 end 

10 H — X k 

/* Quantization part */ 

11 // = 2 * (H > 0) — 1 

/* Extracting two different rows */ 

12 fi p = H( 1,:) 

13 while H( 1, :) = H(i, :) do 

14 | i — i + 1 
is end 

16 h m = H{i ,:) 

17 Convert the entries of h m and h p to the nucleotides. 

4.2 Haplotype assembly using HapNuc 

A popular method for matrix completion is based on relaxing the rank function to a convex function. 
Since the matrix rank is the number of nonzero singular values, an approximation of the rank 
function is given by the summation of the singular values, i.e., the nuclear norm. In this way, this 
problem is defined as [11] 

min ||JT||* subject to H tJ = R tJ for (i,j) € fh (13) 

H 

This problem can be solved easily using the CVX, a MATLAB based package [14]. It has been 
shown that nuclear norm minimization has strong mathematical guarantees to achieve the optimal 
solution [11, 15, 16]. To develop the second new HapNuc algorithm, we substitute the SVT part of 
Algorithm 1 by the nuclear norm minimization. 

4.3 Haplotype assembly using HapOPT 

Another approach for the matrix completion part is known as the OPTSPACE [17] in which unlike 
the two previous techniques, for the matrix completion part we assume that the rank of the desired 
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matrix H is known. The OPTSPACE algorithm consists of three steps including a) trimming, b) 
projection, and c) cleaning as explained below. 

a) In the trimming step, those columns of R with degrees larger than 2\Q\/l are set to zero where 
j • j shows the cardinality of a set and l is the haplotype length. The degree of a column or a row 
is defined as the number of its known entries. This step is also performed for the rows of R with 
degrees larger than 2\Q\/N where N is the number of reads. 

b) The trimmed R obtained from Step (a) is projected to the space of rank r matrices using 

P(R) = jT UP r (E)V H (14) 

where _P r (£) = diag(ai,...ay) and U and V are given by (7). 


c) The cleaning step is performed by solving the following optimization problem, 


min min V (Rij - (XSY 11 )^) 2 

Xm Nxr ,Y£R lxr SeR rxr 


(15) 


which contains two minimization parts. The inner part results in a function in terms of X and 
Y. To solve the outer minimization part, we use a gradient based recursive method whose initial 
matrices are computed from Step (b), i.e., X 0 = U and Y 0 = V. Then, this recursive method 
leads to the optimal solution as X opt S opt Y^ pt . 

To finalize the third new HapOPT algorithm, we should substitute the SVT part of Algorithm 
1 by the above three steps. 


5 RESULTS 

Using extensive simulations, we compare the performance of the proposed HapSVT, HapNuc, and 
HapOPT algorithms with that of SDhaP [8]. Simulations are performed for the haplotype data 
and NGS paired end reads addressed in [18]. The desired haplotypes are with the length of 100. 
The two metrics used for comparing the algorithms are the Mean Square Error (MSE) and the 
reconstruction rate (rr) defined as 

MSE = min{ \\h m - h m || 2 , || h m - h p || 2 } (16) 

rr = jrnm^RV[h m ,h m ),RV(h p ,hp)Y (17) 

where' HT >(•, •) is the augmented hamming distance between two vectors which counts the number 
of sites with identical values calculated as 


i 

KD{a, b) = J2 v { a U), b {j)) 


j=i 


(18) 
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where Z>(-, •) is 


V{a,b) 


0 a = b 
1 otherwise. 


(19) 


Figure 2 compares the MSEs for different number of reads for which the corresponding recon¬ 
struction rates are depicted in Figure 3. 




Fig. 2 : Mean square error vs. the number of Fig. 3: Reconstruction rate vs. the number of 
reads for different algorithms. reads for different algorithms. 

As seen in both figures, the proposed matrix completion-based algorithms outperform the SD- 
haP. Also, the HapNuc outperforms the other ones due to reformulating the problem in convex form 
with a nuclear norm definition which has been deeply investigated in [11], It is worth reminding 
that the SDhaP solves a non-convex optimization problem using a heuristic technique with the 
gradient descent algorithm which does not guarantee reaching the global optimum. Furthermore, 
as a consequence of increasing the number of reads, a better performance results by achieving a 
lower MSE and a higher reconstruction rate. 

The running time of the different algorithms is shown in Table 3. As observed, the HapOPT is 
faster than the other ones because OPTSPACE needs just an SVD and a quadratic optimization 
using a gradient-based method. Although the HapNuc needs more running time in CVX, it performs 
better than the others in terms of the MSE and the reconstruction rate. 


Tab. 3: Simulation time of the algorithms 


Algorithm 

50 reads 

80 reads 

HapOPT (Proposed) 

0.0566 

0.0292 

SDhaP 

0.0598 

0.0452 

HapS VT (Proposed) 

0.1088 

0.1562 

HapNuc (Proposed) 

1.7602 

2.3993 


Furthermore, the box plot of MSE and reconstruction rate over all experiments are depicted for 
each method in Figures 4 and 5, respectively. One can see that all the proposed algorithms generate 
lower variances compared to the SDhaP. 






MSE of HapSVT MSE of HapNuc 
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Fig. 4: Box plot of MSEs. 
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Fig- 5: Box plot of reconstruction rates. 


Moreover, we evaluate the proposed algorithms for the fosmid sequence data of 22nd chromosome 
of the individual NA12878 with the length of 1000 analyzed in [9]. Since for the real data no 
haplotype exists for comparison purposes, the Minimum Error Correction (MEC) score [19] is 
considered as another metric which shows the fidelity of the extracted haplotype to the reads. It is 
assumed that the lower the MEC score, the better the quality of the haplotype extraction [9]. The 
normalized MEC score is given by 
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MECscore = 


1 

m 


N 

^ min(hd(ri, h m ), hd(ry, h p )), 

i=i 


in which hd(r*, h p ) is the hamming distance between a and b defined as [8] 


( 20 ) 


where 


i 

hd(o, b) = yd(a(j),b{j )), 

3 =1 


d(a, b ) 


1 a^0&6^0&a^6 

0 otherwise. 


( 21 ) 


( 22 ) 


This metric for the fosmid data and the corresponding running time are shown in Table 4. As 
seen, among the proposed algorithms, the MEC score for the HapOPT is higher than that of the 
SDhaP. From the above MSE and MEC results, we can conclude that in total the HapOPT shows 
more accurate estimates of haplotypes with almost the same running time. 


Tab. 4: MEC for the Fosmid data. 


Algorithm 

HapNuc(Proposed) 

HapSVT(Proposed) 

HapOPT (Proposed) 

SDhaP 

MEC score 

0.002683 

0.002694 

0.000726 

0.001050 

Time (sec) 

58.04 

67.41 

2.21 

1.14 


6 CONCLUSION 

We have exploited matrix completion methods like the SVT, nuclear norm minimization, and 
OPTSPACE to estimate haplotypes more accurately. This was led to introducing the new HapNuc, 
HapOPT, and HapSVT algorithms. It was shown that the MSE and reconstruction rate of these 
algorithms outperform the recently addressed SDhaP algorithm. From the MEC score aspect, the 
HapOPT generated better results compared to the SDhaP for the real fosmid data with almost the 
same running time. 
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